Bioinformatics A Practical Guide to Next Generation Sequencing Data Analysis (Hamid D. Ismail)

Targeted Gene Metagenomic Data Analysis ◾ 291

7.3.5.2 Using Machine Learning Classifiers

The machine learning taxonomy classifiers use trained model to assign taxa to the

representative sequences rather than using the alignment approach. A classifier requires a

benchmark training dataset with known taxa for model training. With QIIME2, any of the

machine learning methods available in scikit-learn can be used to train a classifier for tax-

onomy assignment. However, there are also pre-fitted classifiers that can be used instead.

In the following, we will use a pre-fitted classifier for the taxonomy assignment, and later,

we will train a new model and use it as well.

For pre-fitted classifier, we can use “classify-sklearn” method of “q2-feature-classifier”

to assign taxa to the representative sequences that have been obtained from clustering or

denoising. First, we need to download a pre-fitted classifier. We can use a classifier pre-

trained on GreenGenes database with 99% OTUs using Naive Bayes machine learning

method. The pre-fitted classifiers are available at QIIME2 website at “https://docs.qiime2.

org/2022.2/data-resources/”. Create the subdirectory “classifiers” and download the classi-

fier artifact into it as follows:

mkdir classifiers

wget -O “classifiers/gg-nb-99-classifier.qza” \

“https://data.qiime2.org/2021.11/common/gg-13-8-99-nb-

classifier.qza”

Once the download has been completed, use that classifier artifact as an input for “clas-

sify-sklearn” method together with the representative sequence artifact generated in the

clustering or denoising step. In the following, we will assign taxa to the representative

sequences generated by DADA2:

qiime feature-classifier classify-sklearn \

--i-classifier classifiers/gg-nb-99-classifier.qza \

--i-reads dada2/rep-seqs_yoga_dada2.qza \

--o-classification taxonomy/nb_tax_yoga_dada2.qza

Instead of using a pre-fitted one, we can train a classifier using “feature-classifier” plu-

gin, which has two methods for model fitting: “fit-classifier-naive-bayes” for the training

of a naïve bayes classifier and “fit-classifier-sklearn” for the training of any scikit-learn

classifier.

Next, we will train a Naive Bayes classifier using GreenGenes reference sequences and

then we will use the fitted classifier to assign taxa to the representative sequences generated

by a previous clustering or denoising step.

For training any classifier, we need a training dataset with known labels. In the case

of taxonomy classification, we need representative sequences with known taxa. For our

example, we can use GreenGenes 13_8 97% OTU dataset. Remember that we downloaded

GreenGenes database before and stored it in the “gg_13_8_otus” subdirectory. We will use

the representative sequences “gg_13_8_otus/rep_set/97_otus.fasta” and their correspond-

ing taxonomic classifications “gg_13_8_otus/taxonomy/97_otu_taxonomy.txt”. Since the